7 research outputs found
TESSERACT:Eliminating Experimental Bias in Malware Classification across Space and Time
Is Android malware classification a solved problem? Published F1 scores of up
to 0.99 appear to leave very little room for improvement. In this paper, we
argue that results are commonly inflated due to two pervasive sources of
experimental bias: "spatial bias" caused by distributions of training and
testing data that are not representative of a real-world deployment; and
"temporal bias" caused by incorrect time splits of training and testing sets,
leading to impossible configurations. We propose a set of space and time
constraints for experiment design that eliminates both sources of bias. We
introduce a new metric that summarizes the expected robustness of a classifier
in a real-world setting, and we present an algorithm to tune its performance.
Finally, we demonstrate how this allows us to evaluate mitigation strategies
for time decay such as active learning. We have implemented our solutions in
TESSERACT, an open source evaluation framework for comparing malware
classifiers in a realistic setting. We used TESSERACT to evaluate three Android
malware classifiers from the literature on a dataset of 129K applications
spanning over three years. Our evaluation confirms that earlier published
results are biased, while also revealing counter-intuitive performance and
showing that appropriate tuning can lead to significant improvements.Comment: This arXiv version (v4) corresponds to the one published at USENIX
Security Symposium 2019, with a fixed typo in Equation (4), which reported an
extra normalization factor of (1/N). The results in the paper and the
released implementation of the TESSERACT framework remain valid and correct
as they rely on Python's numpy implementation of area under the curv
Transcend:Detecting Concept Drift in Malware Classification Models
Building machine learning models of malware behavior is widely accepted as a panacea towards effective malware classification. A crucial requirement for building sustainable learning models, though, is to train on a wide variety of malware samples. Unfortunately, malware evolves rapidly and it thus becomes hard—if not impossible—to generalize learning models to reflect future, previously-unseen behaviors. Consequently, most malware classifiers become unsustainable in the long run, becoming rapidly antiquated as malware continues to evolve. In this work, we propose Transcend, a framework to identify aging classification models in vivo during deployment, much before the machine learning model’s performance starts to degrade. This is a significant departure from conventional approaches that retrain aging models retrospectively when poor performance is observed. Our approach uses a statistical comparison of samples seen during deployment with those used to train the model, thereby building metrics for prediction quality. We show how Transcend can be used to identify concept drift based on two separate case studies on Android andWindows malware, raising a red flag before the model starts making consistently poor decisions due to out-of-date training
Prescience:Probabilistic Guidance on the Retraining Conundrum for Malware Detection
Malware evolves perpetually and relies on increasingly sophisticatedattacks to supersede defense strategies. Datadrivenapproaches to malware detection run the risk of becomingrapidly antiquated. Keeping pace with malwarerequires models that are periodically enriched with freshknowledge, commonly known as retraining. In this work,we propose the use of Venn-Abers predictors for assessingthe quality of binary classification tasks as a first step towardsidentifying antiquated models. One of the key bene-fits behind the use of Venn-Abers predictors is that they areautomatically well calibrated and offer probabilistic guidanceon the identification of nonstationary populations ofmalware. Our framework is agnostic to the underlying classificationalgorithm and can then be used for building betterretraining strategies in the presence of concept drift. Resultsobtained over a timeline-based evaluation with about 90Ksamples show that our framework can identify when modelstend to become obsolete